Initial Look at Data

Dataset

Below are 100 randomly selected rows from the dataset.

Data Summary

The table below shows several metrics calculated against the various columns/variables. These metrics include: the number of unique values, number of NAs, the maximum value, the minimum value, and the mean/average.

Categorical Variables in Bar Charts

Numeric Variables in Histograms

Time Series Variables

Outliers and Notes

Categorial Variable Notes

A closer look at dna_visittrafficsubtype shows that many of the subtypes are rarely found in this dataset. Grouping or combining these in a meaningful manner may help, but unfortunately I doubt I have sufficient information or experience to group the levels of this variable.

Time Series Variables (improved)

After removing the outlier dates (noted above) for ordercreatedate we can better see the general trend.

After removing the NAs from dnatestactivationdayid we can better see the general trend.

Cross-sell Percentage

Daily Trend

Variance appears to tighten up in 2016-2017 and the obvious drop in late 2016 to 2017 will cause problems for most models. Forecasting or predicting could prove difficult if the model isn’t able to account for the sudden drop.

A more detailed view of this daily xsell conversion may help us understand what is influencing this behavior and how that might affect model construction.

I must not understand the regtenure column yet.